Customer churn, also known as customer turnover, occurs when a customer chooses to leave or unsubscribe from a service for any reason. In the context of this project, we are looking at customer churn in banking. Being able to predict customer churn is important because we want to retain as many customers as we can. (Kaemingk, 2018.)
Of course you want to keep customers, but it also is cheaper to retain your customers, than having to spend money to acquire new customers to replace the ones that left. Lower customer acquisition costs equal more profits. By identifying customers at risk of churn, we are able to identify which efforts in order to maximize their likelihood of staying. (Guliyev et al., 2021.)
We chose to analyze a multinational bank’s churn for several reasons. As an international team, we were interested in topics that spanned multiple countries, with a particular focus on either business or environment, our personal interests. When we were pitching topics, nearly every suggestion was geared towards sales or customer experience, or global environmental issues.
The aim of this project is to identify the most useful model for predicting churn churn, across three countries serviced by a multinational bank.
From our EDA in the mid term we found that more than 50% of the customer are based in France with majority of customer of average age group of 39 being in relation with bank for more than a year,51% of active customers while Male customers are predominantly higher than Female customers and also 71% of the customer uses credit card with average credit score of 650 by this analysis we found that 20% of the customer are churned,hence we decided to predict what factors are significantly affecting the customer churn in multinational bank.The model identified can then be used to inform banks of what customer activity trends and attributes should be monitored so the bank can make efforts to retain the customers before the account has been closed.
This study is based on a data set that contains 10,000 observations of bank customers with 12 variables. (See below for a readout of the dataset’s structure and variable names.) Variable descriptions are as follows and we choose the data set from Kaggle (Topre, 2022.)
## 'data.frame': 10000 obs. of 12 variables:
## $ customer_id : int 15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
## $ credit_score : int 619 608 502 699 850 645 822 376 501 684 ...
## $ country : chr "France" "Spain" "France" "France" ...
## $ gender : chr "Female" "Female" "Female" "Female" ...
## $ age : int 42 41 42 39 43 44 50 29 44 27 ...
## $ tenure : int 2 1 8 1 2 8 7 4 4 2 ...
## $ balance : num 0 83808 159661 0 125511 ...
## $ products_number : int 1 1 3 2 1 2 2 4 2 1 ...
## $ credit_card : int 1 0 1 0 1 1 1 1 0 1 ...
## $ active_member : int 1 1 0 0 1 0 1 0 1 1 ...
## $ estimated_salary: num 101349 112543 113932 93827 79084 ...
## $ churn : int 1 0 1 0 0 1 0 1 0 0 ...
In preparation for exploratory data analysis, we took several steps to clean the data. We immediately dropped the customer_id variable, as we do not need the account holder’s unique identifier for our purpose. After dropping customer_id our dataset had 11 variables:
## 'data.frame': 10000 obs. of 11 variables:
## $ credit_score : int 619 608 502 699 850 645 822 376 501 684 ...
## $ country : chr "France" "Spain" "France" "France" ...
## $ gender : chr "Female" "Female" "Female" "Female" ...
## $ age : int 42 41 42 39 43 44 50 29 44 27 ...
## $ tenure : int 2 1 8 1 2 8 7 4 4 2 ...
## $ balance : num 0 83808 159661 0 125511 ...
## $ products_number : int 1 1 3 2 1 2 2 4 2 1 ...
## $ credit_card : int 1 0 1 0 1 1 1 1 0 1 ...
## $ active_member : int 1 1 0 0 1 0 1 0 1 1 ...
## $ estimated_salary: num 101349 112543 113932 93827 79084 ...
## $ churn : int 1 0 1 0 0 1 0 1 0 0 ...
Next, we checked for duplicate records in the data set,we found no duplicates in the data set and also checked for null variables, and 0 were found.We then converted the following variables into categorical variables:credit_card, active_member, churn, gender, tenure, product number, age.We also converted Boolean values (0,1) into character format for Credit_card (credit_card,no-credit card), active_members (Active,In Active) and churn (Churned, Retained) variables in the data set, in order to understand each variable in the data set clearly during plotting and analysis.
| credit_card | active_member | churn |
|---|---|---|
| Credit Card | Active | Churned |
| No-Credit Card | Active | Retained |
| Credit Card | In Active | Churned |
| No-Credit Card | In Active | Retained |
| Credit Card | Active | Retained |
| Credit Card | In Active | Churned |
Finally, we checked our continuous variables for outliers using the outlierKD function.
## Outliers identified: 15
## Proportion (%) of outliers: 0.2
## Mean of the outliers: 361
## Mean without removing outliers: 651
## Mean if we remove outliers: 651
## Outliers successfully removed
## Outliers identified: 359
## Proportion (%) of outliers: 3.7
## Mean of the outliers: 69.3
## Mean without removing outliers: 38.9
## Mean if we remove outliers: 37.8
## Outliers successfully removed
By using the outlierKD function we can observe that outliers were
found only in age and credit_score variables
(age: 3.7% and credit_score: 0.2%), so we decided to remove these two
variables using the outlierKD function.
In this section we are going to analyze each variable in the Bank churn dataset by using plots along with finding calculated mean, SD and percentages for each variable.
The average credit score of the customer is 650.529,most of the customers having credit score fall between 600 to 700 and standard deviation is 96.953. The below histogram shows the range of credit scores.
The customers are grouped by the countries in which they have their accounts. As we see from the plot, France has more than 50% of customer accounts which is the highest among all other countries with Germany and Spain sharing equal percentages.Below, bar plot represents the number of customers in three different countries.
The majority of the bank customers fall below the age of 50 with average age of 39 and with standard deviation of 10.5.
Most of the customers have been with the bank for more than a year.
The bank’s customers are predominantly male which make up 55% of the customer base with the females making up the remaining 45%.
48.5% of customers are being inactive.
Most of the customers use product 1 which is 50% and product 4 is the least used with 0.6%.
Predominantly 71% of the Bank customers use credit card and only 29% do not.
The bank managed to retain 80% of their customers with the remaining 20% where churned out.
So, now the big question was what factors affected the 20% churn rate in the multinational bank across different countries. We utilized chi-squared tests, two-sample T-tests, and corrplots in order to find which factors influenced customer churn:
##
## Pearson's product-moment correlation
##
## data: churn_data$balance and as.numeric(churn_data$churn)
## t = 12, df = 9998, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0992 0.1378
## sample estimates:
## cor
## 0.119
From the correlation plot we observe that age is highly correlated with churn.After combining different columns and finding the correlation score we see that it starts decreasing from 0.285
## [1] 0.285
## [1] 0.272
## [1] 0.283
## [1] 0.119
## [1] 0.269
## [1] 0.119
In our EDA, we found out that our binary target variable, churn, was imbalanced with 7963 instances of 0/Retained customers, and 2037 instances of 1/Churned customers.
With 79.6% of customers 0/Retained, and 20.45 of customers 1/Churned, we needed to balance our dataset and bring this number closer to a 50/50 split for the purpose of our training set. We utilized the ovun.sample function from ROSE package to undersample the data, and result in a more even split. With the new proportions, we are ready to train-test-split and proceed:
Before we work on PCA, we need to ensure that the data is scaled and all the variables are either numeric or int type.
## 'data.frame': 10000 obs. of 11 variables:
## $ credit_score : int 619 608 502 699 850 645 822 376 501 684 ...
## $ country : chr "France" "Spain" "France" "France" ...
## $ gender : chr "Female" "Female" "Female" "Female" ...
## $ age : int 42 41 42 39 43 44 50 29 44 27 ...
## $ tenure : int 2 1 8 1 2 8 7 4 4 2 ...
## $ balance : num 0 83808 159661 0 125511 ...
## $ products_number : int 1 1 3 2 1 2 2 4 2 1 ...
## $ credit_card : chr "Credit Card" "No-Credit Card" "Credit Card" "No-Credit Card" ...
## $ active_member : chr "Active" "Active" "In Active" "In Active" ...
## $ estimated_salary: num 101349 112543 113932 93827 79084 ...
## $ churn : chr "1" "0" "1" "0" ...
To perform dimensiontionality reduction while preserving as much as randomness in the high dimensional space as possible.
Now we will identify correlations in the data
## 'data.frame': 10000 obs. of 11 variables:
## $ credit_score : num 608 699 850 822 684 497 476 549 616 549 ...
## $ country : num 0 2 0 0 2 2 0 1 0 0 ...
## $ gender : num 1 1 1 1 1 0 0 1 0 0 ...
## $ age : int 41 39 43 50 27 24 34 25 45 24 ...
## $ tenure : int 1 1 2 7 2 3 10 5 3 9 ...
## $ balance : num 83808 0 125511 0 134604 ...
## $ products_number : int 1 2 1 2 1 2 2 2 2 2 ...
## $ credit_card : num 1 0 1 0 1 1 1 1 0 1 ...
## $ active_member : num 1 1 0 0 1 0 1 0 1 1 ...
## $ estimated_salary: num 112543 93827 79084 10063 71726 ...
## $ churn : num 0 0 0 0 0 0 0 0 0 0 ...
## 'data.frame': 10000 obs. of 10 variables:
## $ credit_score : num 608 699 850 822 684 497 476 549 616 549 ...
## $ country : num 0 2 0 0 2 2 0 1 0 0 ...
## $ gender : num 1 1 1 1 1 0 0 1 0 0 ...
## $ age : int 41 39 43 50 27 24 34 25 45 24 ...
## $ tenure : int 1 1 2 7 2 3 10 5 3 9 ...
## $ balance : num 83808 0 125511 0 134604 ...
## $ products_number : int 1 2 1 2 1 2 2 2 2 2 ...
## $ credit_card : num 1 0 1 0 1 1 1 1 0 1 ...
## $ active_member : num 1 1 0 0 1 0 1 0 1 1 ...
## $ estimated_salary: num 112543 93827 79084 10063 71726 ...
| credit_score | country | gender | age | tenure | balance | products_number | credit_card | active_member | estimated_salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| credit_score | 1.0000 | 0.0049 | -0.0049 | 0.0062 | 0.0109 | 0.0076 | -0.0191 | -0.0013 | -0.0069 | -0.0142 |
| country | 0.0049 | 1.0000 | -0.0047 | 0.0061 | 0.0026 | -0.0035 | -0.0082 | -0.0085 | 0.0067 | 0.0039 |
| gender | -0.0049 | -0.0047 | 1.0000 | -0.0027 | -0.0076 | -0.0049 | -0.0026 | -0.0058 | -0.0225 | -0.0047 |
| age | 0.0062 | 0.0061 | -0.0027 | 1.0000 | -0.0160 | 0.0649 | -0.0511 | -0.0112 | -0.0092 | -0.0031 |
| tenure | 0.0109 | 0.0026 | -0.0076 | -0.0160 | 1.0000 | 0.0199 | 0.0139 | -0.0085 | -0.0067 | 0.0188 |
| balance | 0.0076 | -0.0035 | -0.0049 | 0.0649 | 0.0199 | 1.0000 | -0.2042 | -0.0047 | 0.0045 | 0.0007 |
| products_number | -0.0191 | -0.0082 | -0.0026 | -0.0511 | 0.0139 | -0.2042 | 1.0000 | 0.0160 | -0.0002 | 0.0139 |
| credit_card | -0.0013 | -0.0085 | -0.0058 | -0.0112 | -0.0085 | -0.0047 | 0.0160 | 1.0000 | -0.0119 | 0.0003 |
| active_member | -0.0069 | 0.0067 | -0.0225 | -0.0092 | -0.0067 | 0.0045 | -0.0002 | -0.0119 | 1.0000 | 0.0054 |
| estimated_salary | -0.0142 | 0.0039 | -0.0047 | -0.0031 | 0.0188 | 0.0007 | 0.0139 | 0.0003 | 0.0054 | 1.0000 |
| credit_score | country | gender | age | tenure | balance | products_number | credit_card | active_member | estimated_salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| credit_score | 1.0000 | 0.0049 | -0.0049 | 0.0062 | 0.0109 | 0.0076 | -0.0191 | -0.0013 | -0.0069 | -0.0142 |
| country | 0.0049 | 1.0000 | -0.0047 | 0.0061 | 0.0026 | -0.0035 | -0.0082 | -0.0085 | 0.0067 | 0.0039 |
| gender | -0.0049 | -0.0047 | 1.0000 | -0.0027 | -0.0076 | -0.0049 | -0.0026 | -0.0058 | -0.0225 | -0.0047 |
| age | 0.0062 | 0.0061 | -0.0027 | 1.0000 | -0.0160 | 0.0649 | -0.0511 | -0.0112 | -0.0092 | -0.0031 |
| tenure | 0.0109 | 0.0026 | -0.0076 | -0.0160 | 1.0000 | 0.0199 | 0.0139 | -0.0085 | -0.0067 | 0.0188 |
| balance | 0.0076 | -0.0035 | -0.0049 | 0.0649 | 0.0199 | 1.0000 | -0.2042 | -0.0047 | 0.0045 | 0.0007 |
| products_number | -0.0191 | -0.0082 | -0.0026 | -0.0511 | 0.0139 | -0.2042 | 1.0000 | 0.0160 | -0.0002 | 0.0139 |
| credit_card | -0.0013 | -0.0085 | -0.0058 | -0.0112 | -0.0085 | -0.0047 | 0.0160 | 1.0000 | -0.0119 | 0.0003 |
| active_member | -0.0069 | 0.0067 | -0.0225 | -0.0092 | -0.0067 | 0.0045 | -0.0002 | -0.0119 | 1.0000 | 0.0054 |
| estimated_salary | -0.0142 | 0.0039 | -0.0047 | -0.0031 | 0.0188 | 0.0007 | 0.0139 | 0.0003 | 0.0054 | 1.0000 |
##
## Call:
## lm(formula = churn ~ country + age + gender, data = churn_data_pc_target)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.256 -0.385 -0.227 0.478 0.925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.191973 0.019323 -9.93 <2e-16 ***
## country -0.005029 0.005644 -0.89 0.37
## age 0.015744 0.000436 36.10 <2e-16 ***
## gender -0.011436 0.009380 -1.22 0.22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.467 on 9996 degrees of freedom
## Multiple R-squared: 0.116, Adjusted R-squared: 0.115
## F-statistic: 435 on 3 and 9996 DF, p-value: <2e-16
From the results of linear regression model we observe that the variables Country$Germany,age,and Gender are highly significant.While gender is negatively impacting the customer churn but customer with certain age group and from Germany affecting the churn rate positively.
##
## Call:
## lm(formula = churn ~ products_number + estimated_salary + balance +
## active_member, data = churn_data_pc_target)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.633 -0.465 -0.328 0.515 0.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.80e-01 1.81e-02 20.99 < 2e-16 ***
## products_number -3.41e-02 7.68e-03 -4.44 9.1e-06 ***
## estimated_salary 1.38e-07 8.58e-08 1.61 0.11
## balance 1.13e-06 8.13e-08 13.85 < 2e-16 ***
## active_member 1.39e-02 9.81e-03 1.41 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.49 on 9995 degrees of freedom
## Multiple R-squared: 0.0246, Adjusted R-squared: 0.0242
## F-statistic: 62.9 on 4 and 9995 DF, p-value: <2e-16
From the results of linear regression model we observe that the variables account balance,account status are highly significant.while the account status impacting the customer churn negatively,but account balance is affecting positively.
## [1] "Case: z-score/scaled"
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
## Standard deviation 1.112 1.017 1.009 1.006 1.004 0.9975 0.990 0.9859 0.9757
## Proportion of Variance 0.124 0.103 0.102 0.101 0.101 0.0995 0.098 0.0972 0.0952
## Cumulative Proportion 0.124 0.227 0.329 0.430 0.531 0.6305 0.729 0.8257 0.9209
## PC10
## Standard deviation 0.8892
## Proportion of Variance 0.0791
## Cumulative Proportion 1.0000
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## credit_score 0.08686 -0.0497 0.26077 -0.72578 0.2115 -0.0281 0.110
## country 0.02565 0.2682 -0.12200 -0.36430 -0.3858 0.7019 -0.337
## gender -0.01024 -0.4806 0.23490 0.16361 -0.5333 -0.0996 -0.419
## age 0.33121 -0.1171 -0.13614 0.01863 -0.1288 0.2761 0.618
## tenure -0.00162 0.4256 0.68282 -0.08285 0.0277 -0.1706 -0.107
## balance 0.66322 0.0683 0.08162 0.12278 0.0842 -0.0486 -0.147
## products_number -0.65915 0.0158 0.00953 -0.03794 -0.0293 0.0428 0.194
## credit_card -0.07459 -0.2234 0.06652 0.21656 0.6926 0.4630 -0.337
## active_member 0.00213 0.4912 -0.53758 -0.00862 0.0642 -0.3091 -0.301
## estimated_salary -0.04565 0.4513 0.27747 0.49301 -0.1136 0.2778 0.204
## PC8 PC9 PC10
## credit_score -0.5781 -0.0409 0.0415
## country 0.1486 -0.0273 0.0411
## gender -0.3499 0.3077 0.0176
## age -0.0162 0.6149 -0.0648
## tenure 0.3348 0.4220 -0.1192
## balance 0.0370 -0.0671 0.7048
## products_number 0.0100 0.2143 0.6910
## credit_card -0.0897 0.2749 -0.0439
## active_member -0.3233 0.4178 -0.0209
## estimated_salary -0.5430 -0.2150 -0.0354
Correlation can vary from -1 to +1 Value close to +1 indicates a high degree of positive correlation Value close to -1 indicates a high degree of negative correlation Value close to zero or 0 indicates no correlation at all
## [1] 7034
##
## 0 1
## 0.795 0.205
From the Biplot between PC1 and PC2 , first of all, it shows the data points plotted (projection) onto the PC1-PC2 plane, with the scales on the bottom and the left tick marks. It also shows the components on this PC1-PC2 plane, with the scales on the top and right tick marks showing the loadings.
PC1 has more variation than PC2
We have also plotted the graphs in between PC2-PC3, AND PC3-PC4: Observe that PC2 has more variation than PC3 and PC3 has more variation than PC4
Here, all components capture the majority of the variability.
From the graph for Proportion of variance we can see that almost 90% of variance is explained by 9 principal components.
Feature selection using Exhaustive Search
Using exhaustive search age,balance were selected which is a two variable model with adjusted R^2 value 0.14.
The best model selected using BIC is credit_score,age,balance which is a 3 variable model with BIC value -1500.
The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model. with Cp value 7.
Mallow Cp Plot The Mallow Cp plot selected two best models.
1)credit_score,country,gender,age,tenure,balance,active_member,estimated salary 8 variable model
2)credit_score,country,gender,age,tenure,balance,estimated_salary.
Feature selection using forward search
The best model selected using forward search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.
The best model selected using BIC is credit_card,age,balance which is a 3 variable model with value -1500 Cp.
The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.
Feature selection using backward search
The best model selected using backward search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.
The best model selected using BIC is credit_card,age,balance which is a 3 variable model with value -1500 Cp.
The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.
Feature selection using Sequential Replacement
The best model selected using sequential search is age,balance which is a 2 variable model with adjusted R^2 value of 0.14.
The best model selected using BIC is credit_card,age,balance,estimated_salary which is a 4 variable model with value -1500 Cp.
The best model selected using Cp is credit_score,age,tenure,balance,estimated_salary which is a 5 variable model with Cp value 7.
The best models selected using feature selection methods are
1 Age, balance(2 variable model)
Forward search: Adj R^2,
Exhaustive search: Adj R^2
Backward search: Adj R^2
Sequential search:Adj R^2
2 Credit_score,age,balance(3 variable model)
Forward search: BIC
Exhaustive search: BIC
Backward search: BIC
Sequential search:BIC
3 Credit_score,age,tenure,balance,estimated_salary(5 variable model)
Exhaustive search: CP
Forward search: CP
Backward search: CP
Sequential search:CP
4.Credit_score, country, gender, age, tenure, balance, estimated_salary(7 variable model)
5 Credit_score,country,gender,age,tenure,balance,active_member,estimated_salary(8 variable model)
When evaluating the models, we are focusing particularly on False negatives because it is vital for any bank to correctly predict churn and in an ideal scenario without any errors. False negatives are of highest importance as we do not want to incorrectly predict that the customers are retained, when they are in fact churned.
The second way we will be evaluating the models is by Area Under the Curve. The higher this number is, the better our model.
Model with all variables:
##
## Call:
## glm(formula = churn ~ ., family = "binomial", data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.705 -0.959 -0.663 1.089 2.206
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.97e+00 1.97e-01 -15.04 < 2e-16 ***
## credit_score -6.47e-04 2.22e-04 -2.92 0.00354 **
## country -2.39e-02 2.62e-02 -0.91 0.36143
## gender -4.81e-02 4.35e-02 -1.11 0.26867
## age 7.10e-02 2.27e-03 31.31 < 2e-16 ***
## tenure -2.17e-04 7.49e-03 -0.03 0.97688
## balance 4.53e-06 3.62e-07 12.51 < 2e-16 ***
## products_number -1.15e-01 3.34e-02 -3.43 0.00061 ***
## credit_card -2.26e-02 4.75e-02 -0.48 0.63478
## active_member 7.66e-02 4.34e-02 1.77 0.07744 .
## estimated_salary 6.48e-07 3.79e-07 1.71 0.08750 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13722 on 9999 degrees of freedom
## Residual deviance: 12297 on 9989 degrees of freedom
## AIC: 12319
##
## Number of Fisher Scoring iterations: 4
Model 1 - churn ~ age + balance Looking at the Adjusted R square results we can see that we get a maximum value of 0.14 with 2 variables namely age and balance for exhaustive search method. If we look at the variables that give the same value of 0.14 for Adjusted R square for forward, backward and sequential search we have the same variables age and balance. Therefore, we build model1 with parameters age and balance.
##
## Call:
## glm(formula = churn ~ age + balance, data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.275 -0.379 -0.193 0.462 0.991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.68e-01 1.89e-02 -14.2 <2e-16 ***
## age 1.54e-02 4.33e-04 35.5 <2e-16 ***
## balance 1.03e-06 7.52e-08 13.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.214)
##
## Null deviance: 2465.0 on 9999 degrees of freedom
## Residual deviance: 2140.8 on 9997 degrees of freedom
## AIC: 12973
##
## Number of Fisher Scoring iterations: 2
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 4557 | 1035 | 5592 |
| Actual 1 | 2089 | 2319 | 4408 |
| Total | 6646 | 3354 | 10000 |
## Area under the curve: 0.734
Model 2 - churn ~ credit_card+age + balance Similarly, for the second model looking at the BIC values we can see that from feature selection we have a minimum value of -1500 from exhaustive search method with 3 variables namely credit_score, age and balance. We can see the same variables associated in both forward search and backward search. For sequential search, we have an additional variable estimated_salary along with the other three variables. Therefore, if you take the variables common in all these search techniques we are left with credit_score, age and balance so we build model2 with these variables.
##
## Call:
## glm(formula = churn ~ credit_card + age + balance, data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.273 -0.380 -0.194 0.461 0.993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.64e-01 2.03e-02 -13.05 <2e-16 ***
## credit_card -4.87e-03 1.02e-02 -0.48 0.63
## age 1.54e-02 4.33e-04 35.46 <2e-16 ***
## balance 1.03e-06 7.52e-08 13.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.214)
##
## Null deviance: 2465.0 on 9999 degrees of freedom
## Residual deviance: 2140.8 on 9996 degrees of freedom
## AIC: 12975
##
## Number of Fisher Scoring iterations: 2
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 4571 | 1021 | 5592 |
| Actual 1 | 2100 | 2308 | 4408 |
| Total | 6671 | 3329 | 10000 |
## Area under the curve: 0.734
Model 3 - churn ~ credit_score+age+tenure+ balance + active_member+estimated_salary Now for the third model we repeat the same procedure but we use the Cp metric. We have a minimum value of 7 for Cp with the variables credit_score,age,tenure,balance and estimated_salary for the exhaustive search method. We have the same variables associated with Cp for forward, backward and sequential search. Therefore, we choose model3 with features credit_score,age,tenure,balance and estimated_salary
##
## Call:
## glm(formula = churn ~ credit_score + age + tenure + balance +
## estimated_salary, data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.270 -0.382 -0.195 0.461 1.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.91e-01 3.78e-02 -5.07 4.1e-07 ***
## credit_score -1.39e-04 4.74e-05 -2.94 0.0033 **
## age 1.54e-02 4.33e-04 35.50 < 2e-16 ***
## tenure -1.70e-04 1.60e-03 -0.11 0.9153
## balance 1.03e-06 7.52e-08 13.68 < 2e-16 ***
## estimated_salary 1.39e-07 8.09e-08 1.72 0.0861 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.214)
##
## Null deviance: 2465.0 on 9999 degrees of freedom
## Residual deviance: 2138.3 on 9994 degrees of freedom
## AIC: 12967
##
## Number of Fisher Scoring iterations: 2
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 4566 | 1026 | 5592 |
| Actual 1 | 2072 | 2336 | 4408 |
| Total | 6638 | 3362 | 10000 |
## Area under the curve: 0.734
Model 4 - churn ~ credit_score + country+ gender+age+tenure+balance+estimated_salary For model4, we need to look at Mallow Cp plot in order to understand the feature selection method used for this model. We can see that for a value of 7 for Mallow Cp we have two models overlapping so, we can’t quite figure out which model to be chosen. The next best value for Mallow Cp is 7.5 with the following features credit_score,country,gender,age,tenure,balance,estimated_salary. We select these features for model4.
##
## Call:
## glm(formula = churn ~ credit_score + country + gender + age +
## tenure + balance + estimated_salary, data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.278 -0.381 -0.196 0.461 1.009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.83e-01 3.82e-02 -4.78 1.8e-06 ***
## credit_score -1.39e-04 4.74e-05 -2.94 0.0033 **
## country -4.69e-03 5.59e-03 -0.84 0.4016
## gender -1.09e-02 9.29e-03 -1.17 0.2409
## age 1.54e-02 4.33e-04 35.50 < 2e-16 ***
## tenure -1.80e-04 1.60e-03 -0.11 0.9102
## balance 1.03e-06 7.52e-08 13.67 < 2e-16 ***
## estimated_salary 1.39e-07 8.09e-08 1.71 0.0865 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.214)
##
## Null deviance: 2465.0 on 9999 degrees of freedom
## Residual deviance: 2137.9 on 9992 degrees of freedom
## AIC: 12969
##
## Number of Fisher Scoring iterations: 2
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 4567 | 1025 | 5592 |
| Actual 1 | 2056 | 2352 | 4408 |
| Total | 6623 | 3377 | 10000 |
## Area under the curve: 0.734
Model 5 - churn ~ credit_score+country+gender+age+tenure+balance+active_member+estimated_salary
##
## Call:
## glm(formula = churn ~ credit_score + country + gender + age +
## tenure + balance + active_member + estimated_salary, data = churn_data_logit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.271 -0.381 -0.196 0.460 1.008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.92e-01 3.86e-02 -4.98 6.6e-07 ***
## credit_score -1.39e-04 4.74e-05 -2.93 0.0034 **
## country -4.76e-03 5.59e-03 -0.85 0.3948
## gender -1.05e-02 9.29e-03 -1.13 0.2576
## age 1.54e-02 4.33e-04 35.52 < 2e-16 ***
## tenure -1.60e-04 1.60e-03 -0.10 0.9204
## balance 1.03e-06 7.52e-08 13.66 < 2e-16 ***
## active_member 1.66e-02 9.26e-03 1.79 0.0733 .
## estimated_salary 1.38e-07 8.09e-08 1.70 0.0882 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.214)
##
## Null deviance: 2465.0 on 9999 degrees of freedom
## Residual deviance: 2137.2 on 9991 degrees of freedom
## AIC: 12968
##
## Number of Fisher Scoring iterations: 2
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 4577 | 1015 | 5592 |
| Actual 1 | 2062 | 2346 | 4408 |
| Total | 6639 | 3361 | 10000 |
## Area under the curve: 0.734
Model 4 - Credit_score, country, gender, age, tenure, balance, active_member,estimated_salary
When creating a model based solely on customer demographics, only the age variable is significant. On the other hand based on salary and account features (number of products, balance, active status), only the balance variable is significant.
In our analysis, 90% of variance explained by 9 principle components with a PC9 Cumulative proportion of variance is 0.92.
Best model Adj R^2: Age + Balance, BIC: Credit Score + Age + Balance, Cp: Credit Score + Age + Balance + Tenure + Est. Salary
Best predictor of customer churn over all: Model 5, which includes features Credit_score + country + gender + age + tenure + balance + active_member + estimated_salary is best performing which has comparatively lower False Negatives (1994) and Area under curve (0.745). The reason for choosing False Negatives is that we are okay with the model predicting churn for a customer when it is actually retained. However, if a customer is actually churned and our model is predicting that it is retained then it is a problem and the errors in classifications must be highlighted as it is very critical for the bank to correctly identify churned customers. Therefore, the verdict for the best models is decided based on the which model has lower False Negatives.
Kaemingk, D. (2018, August 29). Reducing customer churn for banks and financial institutions. Qualtrics. Retrieved November 2, 2022, from https://www.qualtrics.com/blog/customer-churn-banking/
Guliyev, H., & Yerdelen Tatoğlu, F. (2021). Customer churn analysis in banking sector: Evidence from explainable machine learning models. Journal of Applied Microeconometrics, 1(2), 85–99. https://doi.org/10.53753/jame.1.2.03